Search Results: "Steinar H. Gunderson"

21 March 2017

Steinar H. Gunderson: 10-bit H.264 support

Following my previous tests about 10-bit H.264, I did some more practical tests; since media.xiph.org is up again, I did some tests with actual 10-bit input. The results were pretty similar, although of course 4K 60 fps organic content is going to be different at times from the partially rendered 1080p 24 fps clip I used. But I also tested browser support, with good help from people on IRC. It was every bit as bad as I feared: Chrome on desktop (Windows, Linux, macOS) supports 10-bit H.264, although of course without hardware acceleration. Chrome on Android does not. Firefox does not (it tries on macOS, but plays back buggy). iOS does not. VLC does; I didn't try a lot of media players, but obviously ffmpeg-based players should do quite well. I haven't tried Chromecast, but I doubt it works. So I guess that yes, it really is 8-bit H.264 or 10-bit HEVC but I haven't tested the latter yet either :-)

9 March 2017

Steinar H. Gunderson: Tired

To be honest, at this stage I'd actually prefer ads in Wikipedia to having ever more intrusive begging for donations. Please go away soon.

27 February 2017

Steinar H. Gunderson: 10-bit H.264 tests

Following the post about 10-bit Y'CbCr earlier this week, I thought I'd make an actual test of 10-bit H.264 compression for live streaming. The basic question is; sure, it's better-per-bit, but it's also slower, so it is better-per-MHz? This is largely inspired by Ronald Bultje's post about streaming performance, where he largely showed that HEVC is currently useless for live streaming from software; unless you can encode at x264's veryslow preset (which, at 720p60, means basically rather simple content and 20 cores or so), the best x265 presets you can afford will give you worse quality than the best x264 presets you can afford. My results will maybe not be as scientific, but hopefully still enlightening. I used the same test clip as Ronald, namely a two-minute clip of Tears of Steel. Note that this is an 8-bit input, so we're not testing the effects of 10-bit input; it's just testing the increased internal precision in the codec. Since my focus is practical streaming, I ran the last version of x264 at four threads (a typical desktop machine), using one-pass encoding at 4000 kbit/sec. Nageru's speed control has 26 presets to choose from, which gives pretty smooth steps between neighboring ones, but I've been sticking to the ten standard x264 presets (ultrafast, superfast, veryfast, faster, fast, medium, slow, slower, veryslow, placebo). Here's the graph: The x-axis is seconds used for the encode (note the logarithmic scale; placebo takes 200 250 times as long as ultrafast). The y-axis is SSIM dB, so up and to the left is better. The blue line is 8-bit, and the red line is 10-bit. (I ran most encodes five times and averaged the results, but it doesn't really matter, due to the logarithmic scale.) The results are actually much stronger than I assumed; if you run on (8-bit) ultrafast or superfast, you should stay with 8-bit, but from there on, 10-bit is on the Pareto frontier. Actually, 10-bit veryfast (18.187 dB) is better than 8-bit medium (18.111 dB), while being four times as fast! But not all of us have a relation to dB quality, so I chose to also do a test that maybe is a bit more intuitive, centered around bitrate needed for constant quality. I locked quality to 18 dBm, ie., for each preset, I adjusted the bitrate until the SSIM showed 18.000 dB plus/minus 0.001 dB. (Note that this means faster presets get less of a speed advantage, because they need higher bitrate, which means more time spent entropy coding.) Then I measured the encoding time (again five times) and graphed the results: x-axis is again seconds, and y-axis is bitrate needed in kbit/sec, so lower and to the left is better. Blue is again 8-bit and red is again 10-bit. If the previous graph was enough to make me intrigued, this is enough to make me excited. In general, 10-bit gives 20-30% lower bitrate for the same quality and CPU usage! (Compare this with the supposed up to 50% benefits of HEVC over H.264, given infinite CPU usage.) The most dramatic example is when comparing the medium presets directly, where 10-bit runs at 2648 kbit/sec versus 3715 kbit/sec (29% lower bitrate!) and is only 5% slower. As one progresses towards the slower presets, the gap is somewhat narrowed (placebo is 27% slower and only 24% lower bitrate), but in the realistic middle range, the difference is quite marked. If you run 3 Mbit/sec at 10-bit, you get the quality of 4 Mbit/sec at 8-bit. So is 10-bit H.264 a no-brainer? Unfortunately, no; the client hardware support is nearly nil. Not even Skylake, which can do 10-bit HEVC encoding in hardware (and 10-bit VP9 decoding), can do 10-bit H.264 decoding in hardware. Worse still, mobile chipsets generally don't support it. There are rumors that iPhone 6s supports it, but these are unconfirmed; some Android chips support it, but most don't. I guess this explains a lot of the limited uptake; since it's in some ways a new codec, implementers are more keen to get the full benefits of HEVC instead (even though the licensing situation is really icky). The only ones I know that have really picked it up as a distribution format is the anime scene, and they're feeling quite specific pains due to unique content (large gradients giving pronounced banding in undithered 8-bit). So, 10-bit H.264: It's awesome, but you can't have it. Sorry :-)

23 February 2017

Steinar H. Gunderson: Fyrrom recording released

The recording of yesterday's Fyrrom (Samfundet's unofficial take on Boiler Room) is now available on YouTube. Five video inputs, four hours, two DJs, no dropped frames. Good times. Soundcloud coming soon!

21 February 2017

Steinar H. Gunderson: 8-bit Y'CbCr ought to be enough for anyone?

If you take a random computer today, it's pretty much a given that it runs a 24-bit mode (8 bits of each of R, G and B); as we moved from palettized displays at some point during the 90s, we quickly went past 15- and 16-bit and settled on 24-bit. The reasons are simple; 8 bits per channel is easy to work with on CPUs, and it's on the verge of what human vision can distinguish, at least if you add some dither. As we've been slowly taking the CPU off the pixel path and replacing it with GPUs (which has specialized hardware for more kinds of pixels formats), changing formats have become easier, and there's some push to 10-bit (30-bit) deep color for photo pros, but largely, 8-bit per channel is where we are. Yet, I'm now spending time adding 10-bit input (and eventually also 10-bit output) to Nageru. Why? The reason is simple: Y'CbCr. Video traditionally isn't done in RGB, but in Y'CbCr; that is, a black-and-white signal (Y) and then two color-difference signals (Cb and Cr, roughly additional blueness and additional redness , respectively). We started doing this because it was convenient in analog TV (if you separate the two, black-and-white TVs can just ignore the color signal), but we kept doing it because it's very nice for reducing bandwidth: Human vision is much less sensitive to color than to brightness, so we can transfer the color channels in lower resolution and get away with it. (Also, a typical Bayer sensor can't deliver full color resolution anyway.) So most cameras and video codecs work in Y'CbCr, not RGB. Let's look at the implications of using 8-bit Y'CbCr, using a highly simplified model for, well, simplicity. Let's define Y = 1/3 (R + G + B), Cr = R - Y and Cb = B - Y. (The reverse transformation becomes R = Y + Cr, B = Y + Cb and G = 3Y - R - B.) This means that an RGB color such as pure gray ([127, 127, 127]) becomes [127, 0, 0]. All is good, and Y can go from 0 to 255, just like R, G and B can. A pure red ([255, 0, 0]) becomes [85, 170, 0], and a pure blue ([255, 0, 0]) becomes correspondingly [85, 0, 170]. But we can also have negative Cr and Cb values; a pure yellow ([0, 255, 255]) becomes [170, -170, 85], for instance. So we need to squeeze values from -170 to +170 into an 8-bit range, losing accuracy. Even worse, there are valid Y'CbCr triplets that don't correspond to meaningful RGB colors at all. For instance, Y'CbCr [255, 170, 0] would be RGB [425, 85, 255]; R is out of range! And Y'CbCr [255, -170, 0] would be RGB [85, -85, 255], that is, negative green. This isn't a problem for compression, as we can just avoid using those illegal colors with no loss of efficiency. But it means that the conversion in itself causes a loss; actually, if you do the maths on the real formulas (using the BT.601 standard), it turns out only 17% of the 24-bit Y'CbCr code words are valid! In other words, we lose about two and a half bits of data, and our 24 bits of accuracy have been reduced to 21.5. Or, to put it another way; 8-bit Y'CbCr is roughly equivalent to 7-bit RGB. Thus, pretty much all professional video uses 10-bit Y'CbCr. It's much more annoying to deal with (especially when you've got subsampling!), but if you're using SDI, there's not even any 8-bit version defined, so if you insist on 8-bit, you're taking data you're getting on the wire (whether you want it or not) and throwing 20% of it away. UHDTV standards (using HEVC) are also simply not defined for 8-bit; it's 10- and 12-bit only, even on the codec level. Parts of this is because UHDTV also supports HDR, so you have a wider RGB range than usual to begin with, and 8-bit would cause excessive banding. Using it on the codec level makes a lot of sense for another reason, namely that you reduce internal roundoff errors during processing by a lot; errors equal noise, and noise is bad for compression. I've seen numbers of 15% lower bitrate for H.264 at the same quality, although you also have to take into account that the encoeder also needs more CPU power that you could have used for a higher preset in 8-bit. I don't know how the tradeoff here works out, and you also have to take into account decoder support for 10-bit, especially when it comes to hardware. (When it comes to HEVC, Intel didn't get full fixed-function 10-bit support before Kaby Lake!) So indeed, 10-bit Y'CbCr makes sense even for quite normal video. It isn't a no-brainer to turn it on, though even though Nageru uses a compute shader to convert the 4:2:2 10-bit Y'CbCr to something the GPU can sample from quickly (ie., the CPU doesn't need to touch it), and all internal processing is in 16-bit floating point anyway, it still takes a nonzero amount of time to convert compared to just blasting through 8-bit, so my ultraportable probably won't make it anymore. (A discrete GPU has no issues at all, of course. My laptop converts a 720p frame in about 1.4 ms, FWIW.) But it's worth considering when you want to squeeze even more quality out of the system. And of course, there's still 10-bit output support to be written...

2 February 2017

Steinar H. Gunderson: Not going to FOSDEM but a year of Nageru

It's that time of the year :-) And FOSDEM is fun. But this year I won't be going; there was a scheduling conflict, and I didn't really have anything new to present (although I probably could have shifted around priorities to get something). But FOSDEM 2017 also means there's a year since FOSDEM 2016, where I presented Nageru, my live video mixer. And that's been a pretty busy year, so I thought I'd do a live cap from high up above. First of all, Nageru has actually been used in production we did Solskogen and Fyrrom, and both gave invaluable input. Then there have been some non-public events, which have also been useful. The Nageru that's in git right now is evolved considerably from the 1.0.0 that was released last year. diffstat shows 19660 insertions and 3543 deletions; that's counting about 2500 lines of vendored headers, though. Even though I like deleting code much more than adding it, the doubling (from ~10k to ~20k lines) represents a significant amount of new features: 1.1.x added support for non-Intel GPUs. 1.2.x added support for DeckLink input cards (through Blackmagic's proprietary drivers), greatly increasing hardware support, and did a bunch of small UI changes. 1.3.x added x264 support that's strong enough that Nageru has really displaced VLC as my go-to tool for just video-signal-to-H.264-conversion (even though it feels overkill), and also added hotplug support. 1.4.x added multichannel audio support including support for MIDI controllers, and also a disk space indicator (because when you run out of disk during production without understanding that's what happens, it really sucks), and brought extensive end-user documentation. And 1.5.x, in development right now, will add HDMI/SDI output, which, like all the previous changes, requires various rearchitecting and fixing. Of course, there are lots of things that haven't changed as well; the basic UI remains the same, including the way the theme (governing the look-and-feel of the finished video stream) works. The basic design has proved sound, and I don't think I would change a lot if I were to design something like 1.0.0 again. As a small free software project, you have to pick your battles, and I'm certainly glad I didn't start out doing something like network support (or a distributed architecture in general, really). So what's for the next year of Nageru? It's hard to say, and it will definitely depend on the concrete needs of events. A hot candidate (since I might happen to need it) is chroma keying, although good keying is hard to get right and this needs some research. There's also been some discussion around other concrete features, but I won't name them until a firm commitment has been made; priorities can shift around, and it's important to stay flexible. So, enjoy FOSDEM! Perhaps I'll return with a talk in 2018. In the meantime, I'll preparing the stream for the 2017 edition of Fyrrom, and I know for sure there will be more events, more features and more experiences to be had. And, inevitably, more bugs. :-)

22 January 2017

Steinar H. Gunderson: Nageru loopback test

Nageru, my live video mixer, is in the process of gaining HDMI/SDI output for bigscreen use, and in that process, I ran some loopback tests (connecting the output of one card into the input of another) to verify that I had all the colorspace parameters right. (This is of course trivial if you are only sending one input on bit-by-bit, but Nageru is much more flexible, so it really needs to understand what each pixel means.) It turns out that if you mess up any of these parameters ever so slightly, you end up with something like this, this or this. But thankfully, I got this instead on the very first try, so it really seems it's been right all along. :-) (There's a minor first-generation loss in that the SDI chain is 8-bit Y'CbCr instead of 10-bit, but I really can't spot it with the naked eye, and it doesn't compound through generations. I plan to fix that for those with spare GPU power at some point, possibly before 1.5.0 release.)

11 January 2017

Steinar H. Gunderson: 3G-SDI signal support

I had to figure out what kinds of signal you can run over 3G-SDI today, and it's pretty confusing, so I thought I'd share it. For the reference, 3G-SDI is the same as 3G HD-SDI, an extension of HD-SDI, which is an extension of the venerable SDI standard (well, duh). They're all used for running uncompressed audio/video data of regular BNC coaxial cable, possibly hundreds of meters, and are in wide use in professional and semiprofessional setups. So here's the rundown on 3G-SDI capabilities: And then there's dual-link 3G-SDI, which uses two cables instead of one and there's also Blackmagic's proprietary 6G-SDI , which supports basically everything dual-link 3G-SDI does. But in 2015, seemingly there was also a real 6G-SDI and 12G-SDI, and it's unclear to me whether it's in any way compatible with Blackmagic's offering. It's all confusing. But at least, these are the differences from single-link to dual-link 3G-SDI: 4K? I don't know. 120fps? I believe that's also a proprietary extension of some sort. And of course, having a device support 3G-SDI doesn't mean at all it's required to support all of this; in particular, I believe Blackmagic's systems don't support alpha at all except on their single 12G-SDI card, and I'd also not be surprised if RGB support is rather limited in practice.

8 January 2017

Steinar H. Gunderson: SpeedHQ decoder

I reverse-engineered a video codec. (And then the CTO of the company making it became really enthusiastic, and offered help. Life is strange sometimes.) I'd talk about this and some related stuff at FOSDEM, but there's a scheduling conflict, so I will be in s that weekend, not Brussels.

25 December 2016

Steinar H. Gunderson: Cracking a DataEase password

I recently needed to get access to a DataEase database; the person I helped was the legitimate owner of the data, but had forgotten the password, as the database was largely from 1996. There are various companies around the world that seem to do this, or something similar (like give you an API), for a usually unspecified fee; they all have very 90s homepages and in general seem like they have gone out of business a long time ago. And I wasn't prepared to wait. For those of you who don't know DataEase, it's a sort-of relational database for DOS that had its heyday in the late 80s and early 90s (being sort of the cheap cousin of dBase); this is before SQL gained traction as the standard query language, before real multiuser database access, and before variable-width field storage. It is also before reasonable encryption. Let's see what we can do. DataEase has a system where tables are mapped through the data dictionary, which is a table on its own. (Sidenote: MySQL pre-8.0 still does not have this.) This is the file RDRRTAAA.DBM; I don't really know what RDRR stands for, but T is the database letter in case you wanted more than one database in the same directory, and AAA, AAB, AAC etc. is a counter so that a table grows to be too big for one file. (There's also .DBA files for structure of non-system tables, and then some extra stuff for indexes.) DBM files are pretty much the classical, fixed-length 80s-style database files; each row has some flags (I believe these are for e.g. row is deleted ) and then just the rows in fixed format right after each other. For instance, here's one I created as part of testing (just the first few lines of the hexdump are shown):
00000000: 0e 00 01 74 65 73 74 62 61 73 65 00 00 00 00 00  ...testbase.....
00000010: 00 00 00 00 00 00 00 73 46 cc 29 37 00 09 00 00  .......sF.)7....
00000020: 00 00 00 00 00 43 3a 52 44 52 52 54 41 41 41 2e  .....C:RDRRTAAA.
00000030: 44 42 4d 00 00 01 00 0e 00 52 45 50 4f 52 54 20  DBM......REPORT 
00000040: 44 49 52 45 43 54 4f 52 59 00 00 00 00 00 1c bd  DIRECTORY.......
00000050: d4 1a 27 00 00 00 00 00 00 00 00 00 43 3a 52 45  ..'.........C:RE
00000060: 50 4f 54 41 41 41 2e 44 42 4d 00 00 01 00 0e 00  POTAAA.DBM......
00000070: 52 65 6c 61 74 69 6f 6e 73 68 69 70 73 00 00 00  Relationships...
Even without going in-depth, we can see the structure here; there's testbase which maps to C:RDRRTAA.DBM (the RDRR itself), there's a table called REPORT DIRECTORY that maps to C:REPOTAAA.DBM, and then more stuff after that, and so on. However, other tables are not so easily read, because you can ask DataEase to encrypt a table. Let's look at such an encrypted table, like the Users table (containing usernames, passwords not password hashes and some extra information like access level), which is always encrypted:
00000000: 0c 01 9f ed 94 f7 ed 34 ba 88 9f 78 21 92 7b 34  .......4...x!. 4
00000010: ba 88 0f d9 94 05 1e 34 ba 88 a0 78 21 92 7b 34  .......4...x!. 4
00000020: e2 88 9f 78 21 92 7b 34 ba 88 9f 78 21 92 7b 34  ...x!. 4...x!. 4
00000030: ba 88 9f 78 21 92 7b 34 ba 88 9f 78 21 92 7b     ...x!. 4...x!. 
Clearly, this isn't very good encryption; it uses a very short, repetitive key of eight bytes (64 bits). (The data is mostly zero padding, which makes it much easier to spot this.) In fact, in actual data tables, only five of these bytes are set to a non-zero value, which means we have a 40-bit key; export controls? My first assumption here was of course XOR, but through some experimentation, it turned out what you need is actually 8-bit subtraction (with wraparound). The key used is derived from both a database key and a per-table key, both stored in the RDRR; again, if you disassemble, I'm sure you can find the key derivation function, but that's annoying, too. Note, by the way, that this precludes making an attack by just copying tables between databases, since the database key is different. So let's do a plaintext attack. If you assume the plaintext of the bottom row is all padding, that's your key and here's what you end up with:
00000000: 52 79 00 75 73 65 72 00 00 00 00 00 00 00 00 00  Ry.user.........
00000010: 00 00 70 61 73 73 a3 00 00 00 01 00 00 00 00 00  ..pass..........
00000020: 28 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  (...............
00000030: 00 00 00 00 00 00 00 00                          ........ 
Not bad, eh? Actually the first byte of the key here is wrong as far as I know, but it didn't interfere with the fields, so we have what we need to log in. (At that point, we've won, because DataEase will helpfully decrypt everything transparent for us.) However, there's a twist; if the password is longer than four characters, the entire decryption of the Users table changes. Of course, we could run our plaintext attack against every data table and pick out the information by decoding the structure, but again; annoying. So let's see what it looks like if we choose passs instead:
00000000: 0e 01 9f 7a ae 9e 21 f5 08 63 07 6d a3 a1 17 5d  ...z..!..c.m...]
00000010: 70 cb df 36 7e 7c 91 c5 d8 33 d8 3d 73 71 e7 2d  p..6~ ...3.=sq.-
00000020: 7b 9b 3f a5 db d9 4f 95 a8 03 a7 0d 43 41 b7 fd   .?...O.....CA..
00000030: 10 6b 0f 75 ab a9 1f 65 78 d3 77 dd 13 11 87     .k.u...ex.w....
Distinctly more confusing. At this point, of course, we know at which byte positions the username and password start, so if we wanted to, we could just try setting the start byte of the password to every possible byte in turn until we hit 0x00 (DataEase truncates fields at the first zero byte), which would allow us to get in with an empty password. However, I didn't know the username either, and trying two bytes would mean 65536 tries, and I wasn't up for automating macros through DOSBox. So an active attack wasn't too tempting. However, we can look at the last hex byte (where we know the plaintext is 0); it goes 0x5d, 0x2d, 0xfd... and some other bytes go 0x08, 0xd8, 0xa8, 0x78, and so on. So clearly there's an obfuscation here where we have a per-line offset that decreases with 0x30 per line. (Actually, the increase/decrease per line seems to be derived from the key somehow, too.) If we remove that, we end up with:
00000000: 0e 01 9f 7a ae 9e 21 f5 08 63 07 6d a3 a1 17 5d  ...z..!..c.m...]
00000010: a0 fb 0f 66 ae ac c1 f5 08 63 08 6d a3 a1 17 5d  ...f.....c.m...]
00000020: db fb 9f 05 3b 39 af f5 08 63 07 6d a3 a1 17 5d  ....;9...c.m...]
00000030: a0 fb 9f 05 3b 39 af f5 08 63 07 6d a3 a1 17     ....;9...c.m...
Well, OK, this wasn't much more complicated; our fixed key is now 16 bytes long instead of 8 bytes long, but apart from that, we can do exactly the same plaintext attack. (Also, it seems to change per-record now, but we don't see it here, since we've only added one user.) Again, assume the last line is supposed to be all 0x00 and thus use that as a key (plus the last byte from the previous line), and we get:
00000000: 6e 06 00 75 73 65 72 00 00 00 00 00 00 00 00 00  n..user.........
00000010: 00 00 70 61 73 73 12 00 00 00 01 00 00 00 00 00  ..pass..........
00000020: 3b 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ;...............
00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00     ...............
Well, OK, it wasn't perfect; we got pass\x12 instead of passs , so we messed up somehow. I don't know exactly why the fifth character gets messed up like this; actually, it cost me half an hour of trying because the password looked very real but the database wouldn't let me in, but eventually, we just guessed at what the missing letter was supposed to be. So there you have it; practical small-scale cryptanalysis of DOS-era homegrown encryption. Nothing advanced, but the user was happy about getting the data back after a few hours of work. :-)

20 November 2016

Steinar H. Gunderson: Nageru documentation

Even though the World Chess Championship takes up a lot of time these days, I've still found some time for Nageru, my live video mixer. But this time it doesn't come in form of code; rather, I've spent my time writing documentation. I spent some time fretting over what technical solution I wanted. I explicitly wanted end-user documentation, not developer documentation I rarely find HTML-rendered versions of every member function in a class the best way to understand a program anyway. Actually, on the contrary: Having all sorts of syntax interwoven in class comments tends to be more distracting than anything else. Eventually I settled on Sphinx, not because I found it fantastic (in particular, ReST is a pain with its bizarre variable punctuation-based syntax), but because I'm convinced it has all the momentum right now. Just like git did back in the day, the fact that the Linux kernel has chosen it means it will inevitably grow a quite large ecosystem, and I won't be ending up having to maintain it anytime soon. I tried finding a balance between spending time on installation/setup (only really useful for first-time users, and even then, only a subset of them), concept documentation (how to deal with live video in general, and how Nageru fits into a larger ecosystem of software and equipment) and more concrete documentation of all the various features and quirks of Nageru itself. Hopefully, most people will find at least something that's not already obvious to them, without drowning in detail. You can read the documentation at https://nageru.sesse.net/doc/, or if you want to send patches, the right place to patch is the git repository.

5 November 2016

Steinar H. Gunderson: Multithreaded OpenGL

Multithreading continues to be hard (although the alternatives are not really a lot better). While debugging a user issue in Nageru, I found and fixed a few races (mostly harmless in practice, though) in my own code, but also two issues that I filed patches for in Mesa. But that's not enough, it seems; there are still issues that are too subtle for me to figure out on-the-fly. But at least with those patches, I can use interlaced video sources in Nageru on Intel GPUs without segfaulting pretty much immediately. My laptop's GPU isn't fast enough to actually run the YADIF interlacer realtime in 1080p60, though, but it's nice at least not take the program down. (These things are super-sensitive to timing, of course, which is probably why I didn't see them when developing the feature a year or so ago.) As usual, NVIDIA's proprietary drivers seem to be near-flawless in this regard. I'm starting to think maybe it's about massive amounts of QA resources.

26 October 2016

Steinar H. Gunderson: Why does software development take so long?

Nageru 1.4.0 is out (and on its way through the Debian upload process right now), so now you can do live video mixing with multichannel audio to your heart's content. I've already blogged about most of the interesting new features, so instead, I'm trying to answer a question: What took so long? To be clear, I'm not saying 1.4.0 took more time than I really anticipated (on the contrary, I pretty much understood the scope from the beginning, and there was a reason why I didn't go for building this stuff into 1.0.0); but if you just look at the changelog from the outside, it's not immediately obvious why multichannel audio support should take the better part of three months of develoment. What I'm going to say is of course going to be obvious to most software developers, but not everyone is one, and perhaps my experiences will be illuminating. Let's first look at some obvious things that isn't the case: First of all, development is not primarily limited by typing speed. There are about 9,000 lines of new code in 1.4.0 (depending a bit on how you count), and if it was just about typing them in, I would be done in a day or two. On a good keyboard, I can type plain text at more than 800 characters per minute but you hardly ever write code for even a single minute at that speed. Just as when writing a novel, most time is spent thinking, not typing. I also didn't spend a lot of time backtracking; most code I wrote actually ended up in the finished product as opposed to being thrown away. (I'm not as lucky in all of my projects.) It's pretty common to do so if you're in an exploratory phase, but in this case, I had a pretty good idea of what I wanted to do right from the start, and that plan seemed to work. This wasn't a difficult project per se; it just needed to be done (which, in a sense, just increases the mystery). However, even if this isn't at the forefront of science in any way (most code in the world is pretty pedestrian, after all), there's still a lot of decisions to make, on several levels of abstraction. And a lot of those decisions depend on information gathering beforehand. Let's take a look at an example from late in the development cycle, namely support for using MIDI controllers instead of the mouse to control the various widgets. I've kept a pretty meticulous TODO list; it's just a text file on my laptop, but it serves the purpose of a ghetto bugtracker. For 1.4.0, it contains 83 work items (a single-digit number is not ticked off, mostly because I decided not to do those things), which corresponds roughly 1:2 to the number of commits. So let's have a look at what the ~20 MIDI controller items went into. First of all, to allow MIDI controllers to influence the UI, we need a way of getting to it. Since Nageru is single-platform on Linux, ALSA is the obvious choice (if not, I'd probably have to look for a library to put in-between), but seemingly, ALSA has two interfaces (raw MIDI and sequencer). Which one do you want? It sounds like raw MIDI is what we want, but actually, it's the sequencer interface (it does more of the MIDI parsing for you, and generally is friendlier). The first question is where to start picking events from. I went the simplest path and just said I wanted all events anything else would necessitate a UI, a command-line flag, figuring out if we wanted to distinguish between different devices with the same name (and not all devices potentially even have names), and so on. But how do you enumerate devices? (Relatively simple, thankfully.) What do you do if the user inserts a new one while Nageru is running? (Turns out there's a special device you can subscribe to that will tell you about new devices.) What if you get an error on subscription? (Just print a warning and ignore it; it's legitimate not to have access to all devices on the system. By the way, for PCM devices, all of these answers are different.) So now we have a sequencer device, how do we get events from it? Can we do it in the main loop? Turns out it probably doesn't integrate too well with Qt, but it's easy enough to put it in a thread. The class dealing with the MIDI handling now needs locking; what mutex granularity do we want? (Experience will tell you that you nearly always just want one mutex. Two mutexes give you all sorts of headaches with ordering them, and nearly never gives any gain.) ALSA expects us to poll() a given set of descriptors for data, but on shutdown, how do you break out of that poll to tell the thread to go away? (The simplest way on Linux is using an eventfd.) There's a quirk where if you get two or more MIDI messages right after each other and only read one, poll() won't trigger to alert you there are more left. Did you know that? (I didn't. I also can't find it documented. Perhaps it's a bug?) It took me some looking into sample code to find it. Oh, and ALSA uses POSIX error codes to signal errors (like nothing more is available ), but it doesn't use errno. OK, so you have events (like controller 3 was set to value 47 ); what do you do about them? The meaning of the controller numbers is different from device to device, and there's no open format for describing them. So I had to make a format describing the mapping; I used protobuf (I have lots of experience with it) to make a simple text-based format, but it's obviously a nightmare to set up 50+ controllers by hand in a text file, so I had to make an UI for this. My initial thought was making a grid of spinners (similar to how the input mapping dialog already worked), but then I realized that there isn't an easy way to make headlines in Qt's grid. (You can substitute a label widget for a single cell, but not for an entire row. Who knew?) So after some searching, I found out that it would be better to have a tree view (Qt Creator does this), and then you can treat that more-or-less as a table for the rows that should be editable. Of course, guessing controller numbers is impossible even in an editor, so I wanted it to respond to MIDI events. This means the editor needs to take over the role as MIDI receiver from the main UI. How you do that in a thread-safe way? (Reuse the existing mutex; you don't generally want to use atomics for complicated things.) Thinking about it, shouldn't the MIDI mapper just support multiple receivers at a time? (Doubtful; you don't want your random controller fiddling during setup to actually influence the audio on a running stream. And would you use the old or the new mapping?) And do you really need to set up every single controller for each bus, given that the mapping is pretty much guaranteed to be similar for them? Making a guess bus button doesn't seem too difficult, where if you have one correctly set up controller on the bus, it can guess from a neighboring bus (assuming a static offset). But what if there's conflicting information? OK; then you should disable the button. So now the enable/disable status of that button depends on which cell in your grid has the focus; how do you get at those events? (Install an event filter, or subclass the spinner.) And so on, and so on, and so on. You could argue that most of these questions go away with experience; if you're an expert in a given API, you can answer most of these questions in a minute or two even if you haven't heard the exact question before. But you can't expect even experienced developers to be an expert in all possible libraries; if you know everything there is to know about Qt, ALSA, x264, ffmpeg, OpenGL, VA-API, libusb, microhttpd and Lua (in addition to C++11, of course), I'm sure you'd be a great fit for Nageru, but I'd wager that pretty few developers fit that bill. I've written C++ for almost 20 years now (almost ten of them professionally), and that experience certainly helps boosting productivity, but I can't say I expect a 10x reduction in my own development time at any point. You could also argue, of course, that spending so much time on the editor is wasted, since most users will only ever see it once. But here's the point; it's not actually a lot of time. The only reason why it seems like so much is that I bothered to write two paragraphs about it; it's not a particular pain point, it just adds to the total. Also, the first impression matters a lot if the user can't get the editor to work, they also can't get the MIDI controller to work, and is likely to just go do something else. A common misconception is that just switching languages or using libraries will help you a lot. (Witness the never-ending stream of software that advertises written in Foo or uses Bar as if it were a feature.) For the former, note that nothing I've said so far is specific to my choice of language (C++), and I've certainly avoided a bunch of battles by making that specific choice over, say, Python. For the latter, note that most of these problems are actually related to library use libraries are great, and they solve a bunch of problems I'm really glad I didn't have to worry about (how should each button look?), but they still give their own interaction problems. And even when you're a master of your chosen programming environment, things still take time, because you have all those decisions to make on top of your libraries. Of course, there are cases where libraries really solve your entire problem and your code gets reduced to 100 trivial lines, but that's really only when you're solving a problem that's been solved a million times before. Congrats on making that blog in Rails; I'm sure you're advancing the world. (To make things worse, usually this breaks down when you want to stray ever so slightly from what was intended by the library or framework author. What seems like a perfect match can suddenly become a development trap where you spend more of your time trying to become an expert in working around the given library than actually doing any development.) The entire thing reminds me of the famous essay No Silver Bullet by Fred Brooks, but perhaps even more so, this quote from John Carmack's .plan has struck with me (incidentally about mobile game development in 2006, but the basic story still rings true):
To some degree this is already the case on high end BREW phones today. I have a pretty clear idea what a maxed out software renderer would look like for that class of phones, and it wouldn't be the PlayStation-esq 3D graphics that seems to be the standard direction. When I was doing the graphics engine upgrades for BREW, I started along those lines, but after putting in a couple days at it I realized that I just couldn't afford to spend the time to finish the work. "A clear vision" doesn't mean I can necessarily implement it in a very small integral number of days.
In a sense, programming is all about what your program should do in the first place. The how question is just the what , moved down the chain of abstractions until it ends up where a computer can understand it, and at that point, the three words multichannel audio support have become those 9,000 lines that describe in perfect detail what's going on.

16 October 2016

Steinar H. Gunderson: backup.sh opensourced

It's been said that backup is a bit like flossing; everybody knows you should do it, but nobody does it. If you want to start flossing, an immediate question is what kind of dental floss to get and conversely, for backup, which backup software do you want to rely on? I had some criteria: I looked at basically everything that existed in Debian and then some, and all of them failed. But Samfundet had its own script that's basically just a simple wrapper around tar and ssh, which has worked for 15+ years without a hitch (including several restores), so why not use it? All the authors agreed to a GPLv2+ licensing, so now it's time for backup.sh to meet the world. It does about the simplest thing you can imagine: ssh to the server and use GNU tar to tar down every filesystem that has the dump bit set in fstab. Every 30 days, it does a full backup; otherwise, it does an incremental backup using GNU tar's incremental mode (which makes sure you will also get information about file deletes). It doesn't do inter-file diffs (so if you have huge files that change only a little bit every day, you'll get blowup), and you can't do single-file restores without basically scanning through all the files; tar isn't random-access. So it doesn't do much fancy, but it works, and it sends you a nice little email every day so you can know your backup went well. (There's also a less frequently used mode where the backed-up server encrypts the backup using GnuPG, so you don't even need to trust the backup server.) It really takes fifteen minutes to set up, so now there's no excuse. :-) Oh, and the only good dental floss is this one. :-)

2 October 2016

Steinar H. Gunderson: SNMP MIB setup

If you just install the snmp package out of the box, you won't get the MIBs, so it's pretty much useless for anything vendor without some setup. I'm sure this is documented somewhere, but I have to figure it out afresh every single time, so this time I'm writing it down; I can't possibly be the only one getting confused. First, install snmp-mibs-downloader from non-free. You'll need to work around bug #839574 to get the Cisco MIBs right:
# cp /usr/share/doc/snmp-mibs-downloader/examples/cisco.conf /etc/snmp-mibs-downloader/
# gzip -cd /usr/share/doc/snmp-mibs-downloader/examples/ciscolist.gz > /etc/snmp-mibs-downloader/ciscolist
Now you can download the Cisco MIBs:
# download-mibs cisco
However, this only downloads them; you will need to modify snmp.conf to actually use them. Comment out the line that says mibs : , and then add:
mibdirs +/var/lib/snmp/mibs/cisco/
Voila! Now you can use snmpwalk with e.g. -m AIRESPACE-WIRELESS-MIB to get the full range of Cisco WLC objects (and the first time you do so as root or the Debian-snmp user, the MIBs will be indexed in /var/lib/snmp/mib_indexes/.)

25 September 2016

Steinar H. Gunderson: Nageru @ Fyrrom

When Samfundet wanted to make their own Boiler Room spinoff (called Fyrrom more or less a direct translation), it was a great opportunity to try out the new multitrack code in Nageru. After all, what can go wrong with a pretty much untested and unfinished git branch, right? So we cobbled together a bunch of random equipment from here and there: Video equipment Hooked it up to Nageru: Nageru screenshot and together with some great work from the people actually pulling together the event, this was the result. Lots of fun. And yes, some bugs were discovered of course, field testing without followup patches is meaningless (that would either mean you're not actually taking your test experience into account, or that your testing gave no actionable feedback and thus was useless), so they will be fixed in due time for the 1.4.0 release. Edit: Fixed a screenshot link.

16 September 2016

Steinar H. Gunderson: BBR opensourced

This is pretty big stuff for anyone who cares about TCP. Huge congrats to the team at Google.

4 September 2016

Steinar H. Gunderson: Multitrack audio in Nageru 1.4.0

Even though the Chess Olympiad takes some attention right now, development on Nageru (my live video mixer) has continued steadily throughout since the 1.3.0 release. I wanted to take a little time to talk about the upcoming 1.4.0 release, and why things are as they are; writing down things often make them a bit clearer. Every major release of Nageru has had a specific primary focus: 1.0.0 was about just getting everything to work, 1.1.0 was about NVIDIA support for more oomph, 1.2.0 was about stabilization and polish (and added support for Blackmagic's PCI cards as a nice little bonus), and 1.3.0 was about x264 integration. For 1.4.0, I wanted to work on multitrack audio and mixing. Audio has always been a clear focus for Nageru, and for good reason; video is 90% about audio, and it's sorely neglected in most amateur productions (not to mention that processing tools are nearly non-existant in most free or cheap software video mixers). Right from the get-go, it's had a chain with proper leveling, compressors and most importantly visual monitoring, so that you know when things are not as they should be. However, it was also written with an assumption that there would be a single audio input source (one of the cameras), and that's going to change. Single-input is less of a showstopper than one'd think at first; you can work around it by buying a mixer, plug everything into that and then feed that signal into the computer. However, there are a few downsides: If you want camera audio, you'll need to pull more cable from each camera (or have SDI de-embedders). Your mixer is likely to require an extra power supply, and that means yet more cable (any decent USB video card can power over USB, so why shouldn't your audio?). You'll need to buy and transport yet another device. And so on. (If you already have a PA mixer, of course you can use it, but just reusing the PA mix as a stream mix rarely gives the best results, and mixing on an aux bus gives very little flexibility.) So for 1.4.0, I wanted something to get essentially the processing equivalent of a mid-range mixer. But even though my education is in DSP, my experience with mixers is rather limited, so I did the only reasonable thing and went over to a friend who's also an (award-winning) audio engineer. (It turns out that everything on a mixer is the way it is for a pretty good reason, tuned through 50+ years of collective audio experience. If you just try to make up something on your own without understanding what's going on, you have a 0.001% chance of stumbling upon some genius new way of doing things by accident, and a significantly larger chance than that of messing things up.) After some back and forth, we figured out a reasonable set of basic tools that would be useful in the right hands, and not too confusing for a beginner. So let's have a look at the new controls you get: Nageru expanded audio control view There's one set of these controls for each bus. (This is the expanded view; there's also a compact view that has only the meters and the fader, which is what you'll typically want to use during the run itself the expanded view is for setup and tweaking.) A bus in Nageru is a pair of channels (left/right), sourced from a video capture or ALSA card. The channel mapping is flexible; my USB sound card has 18 channels, for instance, and you can use that to make several buses. Each bus has a name (here I named it very creatively Main , but in a real setting you might want something like Blue microphone or Speaker PC ), which is just for convenience; it doesn't mean much. The most important parts of the mix are given the most screen estate, so even though the way through the signal chain is left-to-right top-to-bottom, I'll go over it in the opposite direction. By far the most important part is the audio level, so the fader naturally is very prominent. (Note that the scale is nonlinear; you want more resolution in the most important area.) Changing a fader with the mouse or keyboard is possible, and probably most people will be doing that, but Nageru will also support USB faders. These usually speak MIDI, for historical reasons, and there are some UI challenges when they're all so different, but you can get really small ones if you want to get that tactile feel without blowing up your budget or getting a bigger backpack. Then there's the meter to the left of that. Nageru already has R128 level meters in the mastering section (not shown here, but generally unchanged from 1.3.0), and those are kept as-is, but for each bus, you don't want to know loudness; you want to know recording levels, so you want a peak meter, not a loudness meter. In particular, you don't want the bus to send clipped data to the master (which would happen if you set it too high); Nageru can handle this situation pretty well (unlike most digital mixers, it mixes in full 32-bit floating-point so there's no internal clipping, and there's a limiter on the master by default), but it's still not a good place to be in, so you can see that being marked in red in this example. The meter doubles as an input peak check during setup; if you turn off all the effects and set the fader to neutral, you can see if the input hits peak or not, and then adjust it down. (Also, you can see here that I only have audio in the left channel; I'd better check my connections, or perhaps just use mono, by setting the right channel on the bus mapping to the same input as the left one.) The compressor (now moved from the mastering section to each bus) should be well-known for those using 1.3.0, but in this view, it also has a reduction meter, so that you can see whether it kicks in or not. Most casual users would want to just leave the gain staging and compressor settings alone, but a skilled audio engineer will know how to adjust these to each speaker's antics (some speak at a pretty even volume and thus can get a bit of headroom, while some are much more variable and need tighter settings). Finally (or, well, first), there's the EQ section. The lo-cut is again well-known from 1.3.0 (the cutoff frequency is the same across all buses), but there's now also a simple three-band EQ per bus. Simply ask the speaker to talk normally for a bit, and tweak the controls until it sounds good. People have different voices and different ways of holding the microphone, and if you have a reasonable ear, you can use the EQ to your advantage to make them sound a little more even on the stream. Either that, or just put it in neutral, and the entire EQ code will be bypassed. The code is making pretty good progress; all the DSP stuff is done (save for some optimizations I want to do in zita-resampler, now that the discussion flow is started again), and in theory, one could use it already as-is. However, there's a fair amount of gnarly support code that still needs to be written: In particular, I need to do some refactoring to support ALSA hotplug (you don't want your entire stream to go down just because an USB soundcard dropped out for a split second), and similarly some serialization of saving/loading bus mappings. It's not exactly rocket science, but all the code still needs to be written, and there are a number of corner cases to think of. If you want to peek, the code is in the multichannel_audio branch, but beware; I rebase/squash it pretty frequently, so if you pull from it, expect frequent git breakage.

14 August 2016

Steinar H. Gunderson: Linear interpolation, source alignment, and Debian's embedding policy

At some point back when the dinosaurs roamed the Earth and I was in high school, I borrowed my first digital signal processing book from a friend. I later went on to an engineering education and master's thesis about DSP, but the very basics of DSP never stop to fascinate me. Today, I wanted to write something about one of them and how it affects audio processing in Nageru (and finally, how Debian's policies put me in a bit of a bind on this issue). DSP texts tend to obscure profound truths with large amounts of maths, so I'll try to present a somewhat less general result that doesn't require going into the mathematical details. That rule is: Adding a signal to weighted, delayed copies of itself is a filtering operation. (It's simple, but ignoring it will have sinister effects, as we'll see later.) Let's see exactly what that means with a motivating example. Let's say that I have a signal where I want to get rid of (or rather, reduce) high frequencies. The simplest way I can think of is to add every neighboring sample; that is, set yn = xn + xn-1. For each sample, we add the previous sample, ie., the signal as it was one sample ago. (We ignore what happens at the edges; the common convention is to assume signals extend out to infinity with zeros.) What effect will this have? We can figure it out with some trigonometry, but let's just demonstrate it by plotting instead: We assume 48 kHz sample rate (which means that our one-sample delay is 20.83 s) and a 22 kHz note (definitely treble!), and plot the signal with one-sample delay (the x axis is sample number): Filtered 22 kHz As you can see, the resulting signal is a new signal of the same frequency (which is always true; linear filtering can never create new frequencies, just boost or dampen existing ones), but with much lower amplitude. The signal and the delayed version of it end up cancelling each other mostly out. Also note that there signal has changed phase; the resulting signal has been a bit delayed compared to the original. Now let's look at a 50 Hz signal (turn on your bass face). We need to zoom out a bit to see full 50 Hz cycles: Filtered 50 Hz The original signal and the delayed one overlap almost exactly! For a lower frequency, the one-sample delay means almost nothing (since the waveform is varying so slowly), and thus, in this case, the resulting signal is amplified, not dampened. (The signal has changed phase here, too actually exactly as much in terms of real time but we don't really see it, because we've zoomed out.) Real signals are not pure sines, but they can be seen as sums of many sines (another fundamental DSP result), and since filtering is a linear operation, it affects those sines independently. In other words, we now have a very simple filter that will amplify low frequencies and dampen high frequencies (and delay the entire signal a little bit). We can do this for all frequencies from 0 to 24000 Hz; let's ask Octave to do it for us: Frequency plot for simple filter (Of course, in a real filter, we'd probably multiply the result with 0.5 to leave the bass untouched instead of boosting it, but it doesn't really change anything. A real filter would have a lot more coefficients, though, and they wouldn't all be the same!) Let's now turn to a problem that will at first seem different: Combining audio from multiple different time sources. For instance, when mixing video, you could have input from two different cameras or sounds card and would want to combine them (say, a source playing music and then some audience sound from a camera). However, unless you are lucky enough to have a professional-grade setup where everything runs off the same clock (and separate clock source cables run to every device), they won't be in sync; sample clocks are good, but they are not perfect, and they have e.g. some temperature variance. Say we have really good clocks and they only differ by 0.01%; this means that after an hour of streaming, we have 360 ms delay, completely ruining lip sync! This means we'll need to resample at least one of the sources to match the other; that is, play one of them faster or slower than it came in originally. There are two problems here: How do you determine how much to resample the signals, and how do we resample them? The former is a difficult problem in its own right; about every algorithm not backed in solid control theory is doomed to fail in one way or another, and when they fail, it's extremely annoying to listen to. Nageru follows a 2012 paper by Fons Adriaensen; GStreamer does well, something else. It fails pretty badly in a number of cases; see e.g. this 2015 master's thesis that tries to patch it up. However, let's ignore this part of the problem for now and focus on the resampling. So let's look at the case where we've determined we have a signal and need to play it 0.01% faster (or slower); in a real situation, this number would vary a bit (clocks are not even consistently wrong). This means that at some point, we want to output sample number 3000 and that corresponds to input sample number 3000.3, ie., we need to figure out what's between two input samples. As with so many other things, there's a way to do this that's simple, obvious and wrong, namely linear interpolation. The basis of linear interpolation is to look at the two neighboring samples and weigh them according to the position we want. If we need sample 3000.3, we calculate y = 0.7 x3000 + 0.3 x3001 (don't switch the two coefficients!), or, if we want to save one multiplication and get better numerical behavior, we can use the equivalent y = x3000 + 0.3 (x3001 - x3000). And if we need sample 5000.5, we take y = 0.5 x5000 + 0.5 x5001. And after a while, we'll be back on integer samples; output sample 10001 corresponds to x10000 exactly. By now, I guess it should be obvious what's going on: We're creating a filter! Linear interpolation will inevitably result in high frequencies being dampened; and even worse, we are creating a time-varying filter, which means that the amount of dampening will vary over time. This manifests itself as a kind of high-frequency flutter , where the amount of flutter depends on the relative resampling frequencies. There's also cubic resampling (which can mean any of several different algorithms), but it only really reduces the problem, it doesn't really solve it. The proper way of interpolating depends a lot on exactly what you want (e.g., whether you intend to change the rate quickly or not); this paper lays out a bunch of them, and was the paper that originally made me understand why linear interpolation is so bad. Nageru outsources this problem to zita-resampler, again by Fons Adriaensen; it yields extremely high-quality resampling under controlled delay, through a relatively common technique known as polyphase filters. Unfortunately, doing this kind of calculations takes CPU. Not a lot of CPU, but Nageru runs in rather CPU-challenged environments (ultraportable laptops where the GPU wants most of the TDP, and the CPU has to go down to the lowest frequency), and it is moving in a direction where it needs to resample many more channels (more on the later), so every bit of CPU helps. So I coded up an SSE optimization of the inner loop for a particular common case (stereo signals) and sent it in for upstream inclusion. (It made the code 2.7 times as fast without any structural changes or reducing precision, which is pretty much what you can expect from SSE.) Unfortunately, after a productive discussion, suddenly upstream went silent. I tried pinging, pinging again, and after half a year pinging again, but to no avail. I filed the patch in Debian's BTS, but the maintainer understandably is reluctant to carry a delta against upstream. I also can't embed a copy; Debian policy would dictate that I build against the system's zita-resampler. I could work around it by rewriting zita-resampler until it looks nothing like the original, which might be a good idea anyway if I wanted to squeeze out the last drops of speed; there are AVX optimizations to be had in addition to SSE, and the structure as-is isn't ideal for SSE optimizations (although some of the changes I have in mind would have to be offset against increased L1 cache footprint, so careful benchmarking would be needed). But in a sense, it feels like just working around a policy that's there for good reason. So like I said, I'm in a bit of a bind. Maybe I should just buy a faster laptop. Oh, and how does GStreamer solve this? Well, it doesn't use linear interpolation. It does something even worse it uses nearest neighbor. Gah. Update: I was asked to clarify that this is about the audio resampling done by the GStreamer audio sink to sync signals, not in the audioresample element, which solves a related but different problem (static sample rate conversion). The audioresample element supports a number of different resampling methods; I haven't evaluated them.

27 July 2016

Steinar H. Gunderson: Nageru in Debian

Uploading to ftp-master (via ftp to ftp.upload.debian.org):
Uploading nageru_1.3.3-1.dsc: done.
Uploading nageru_1.3.3.orig.tar.gz: done.
Uploading nageru_1.3.3-1.debian.tar.xz: done.
Uploading nageru-dbgsym_1.3.3-1_amd64.deb: done.
Uploading nageru_1.3.3-1_amd64.deb: done.
Uploading nageru_1.3.3-1_amd64.changes: done.
So now it's in the NEW queue, along with its dependency bmusb. Let's see if I made any fatal mistakes in release preparation :-) Edit: Whoa, that was fast ACCEPTED into unstable.

Next.

Previous.